Back

Journal of Cheminformatics

Springer Science and Business Media LLC

All preprints, ranked by how well they match Journal of Cheminformatics's content profile, based on 25 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
PanScreen: A Comprehensive Approach to Off-Target Liability Assessment

Sellner, M. S.; Lill, M. A.; Smiesko, M.

2023-11-17 bioinformatics 10.1101/2023.11.16.567496 medRxiv
Top 0.1%
19.8%
Show abstract

Drug development projects are getting increasingly more expensive while their success rate is stagnating. Safety issues attributed to off-target binding represent a major reason for the failure of new drugs. Besides desired on-target binding, small molecules may interact with off-targets, triggering adverse effects. Therefore, the development of novel methods for early recognition of such issues that are resource-efficient and cost-effective becomes vital. Here, we introduce PanScreen, an online platform for the automated assessment of off-target liabilities. PanScreen combines structure-based modeling techniques with state-of-the-art deep learning methods to not only predict accurate binding affinities but also give insight into potential modes of action. We show that the predictions are approaching experimental accuracy found in public datasets and that the same technology can also be used for other research areas, such as drug repurposing. Such fast and inexpensive methods allow researchers to test not only drug candidates, but all small molecules that might come into contact with a human organism for potential safety concerns very early in the development process. PanScreen is publicly available at www.panscreen.ch.

2
Improving the assessment of deep learning models in the context of drug-target interaction prediction

Torrisi, M.; de la Vega de Leon, A.; Climent, G.; Loos, R.; Panjkovich, A.

2022-04-21 bioinformatics 10.1101/2022.04.20.488898 medRxiv
Top 0.1%
19.8%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of machine learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based machine learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a machine learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or (3) neither.

3
Bento: Benchmarking Classical and AI Docking on Drug Design-Relevant Data

Pak, M. A.; Frolova, D.; Nikolenko, S. A.; Daulbaev, T.; Ryabchenko, D.; Litvin, A.; Gurevich, P.; Garifullin, K.; Shapeev, A.; Oseledets, I.; Ivankov, D. N.

2025-12-30 bioinformatics 10.64898/2025.12.30.696741 medRxiv
Top 0.1%
19.8%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWRecent advances in artificial intelligence have introduced deep learning and co-folding approaches for predicting protein-ligand complexes, raising the question of their applicability and how they compare with classical docking methods. In this work, we present a thorough benchmarking study of eleven tools for protein-ligand interaction prediction, spanning classical molecular docking methods, deep learning-based models, and co-folding algorithms. While most related benchmarking efforts primarily assess the generalization capacity, we extend the analysis to also evaluate the performance on drug design-relevant data and across different classes of protein-ligand complexes. Here, we introduce BO_SCPLOWENTOC_SCPLOW, a comprehensive benchmark that evaluates 11 tools for protein-ligand interaction prediction - both established and recently developed - across four test datasets and multiple derived subsets in a pocket-aware setup. We show that 1) careful dataset curation is essential - filtering by pocket structural similarity and controlling ligand complexity exposes generalization failures that are obscured in conventional benchmarks; 2) classical and deep learning-based docking tools perform similarly well on drug-like ligands, making them comparably useful for virtual screening, with physics-based methods offering a clear advantage in speed; 3) co-folding tools outperform other approaches on structurally complex ligands, whereas most methods achieve similar accuracy on regular small molecules; and 4) all methods struggle to generalize to unseen pockets, with deep learning models being the most prone to overfitting. Overall, our results show that while current docking and DL-based approaches are reliable for many drug-design-relevant scenarios, genuine pocket-level generalization remains an open challenge. BO_SCPLOWENTOC_SCPLOW provides a rigorous and transparent framework for diagnosing these limitations and guiding the development of more robust protein-ligand prediction models. The data and code of Bento are available at https://github.com/LigandPro/Bento.

4
CPSign - Conformal Prediction for Cheminformatics Modeling

McShane, S. A.; Norinder, U.; Alvarsson, J.; Ahlberg, E.; Carlsson, L.; Spjuth, O.

2023-11-22 bioinformatics 10.1101/2023.11.21.568108 medRxiv
Top 0.1%
18.9%
Show abstract

Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4j models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data.

5
Benchmarking ensemble docking methods as a scientific outreach project

Gan, J. L.; Kumar, D.; Chen, C.; Taylor, B. C.; Jagger, B. R.; Amaro, R. E.; Lee, C. T.

2020-10-04 scientific communication and education 10.1101/2020.10.02.324343 medRxiv
Top 0.1%
18.8%
Show abstract

The discovery of new drugs is a time consuming and expensive process. Methods such as virtual screening, which can filter out ineffective compounds from drug libraries prior to expensive experimental study, have become popular research topics. As the computational drug discovery community has grown, in order to benchmark the various advances in methodology, organizations such as the Drug Design Data Resource have begun hosting blinded grand challenges seeking to identify the best methods for ligand pose-prediction, ligand affinity ranking, and free energy calculations. Such open challenges offer a unique opportunity for researchers to partner with junior students (e.g., high school and undergraduate) to validate basic yet fundamental hypotheses considered to be uninteresting to domain experts. Here, we, a group of high school-aged students and their mentors, present the results of our participation in Grand Challenge 4 where we predicted ligand affinity rankings for the Cathepsin S protease, an important protein target for autoimmune diseases. To investigate the effect of incorporating receptor dynamics on ligand affinity rankings, we employed the Relaxed Complex Scheme, a molecular docking method paired with molecular dynamics-generated receptor conformations. We found that CatS is a difficult target for molecular docking and we explore some advanced methods such as distance-restrained docking to try to improve the correlation with experiments. This project has exemplified the capabilities of high school students when supported with a rigorous curriculum, and demonstrates the value of community-driven competitions for beginners in computational drug discovery.

6
Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Chong, A.; Phua, S.-X.; Xiao, Y.; Ng, W. Y.; Li, H. Y.; Goh, W. W. B.

2024-03-31 cancer biology 10.1101/2024.03.28.587184 medRxiv
Top 0.1%
18.5%
Show abstract

Researchers have adopted model-centric artificial intelligence (AI) approaches in cheminformatics by using newer, more sophisticated AI methods to take advantage of growing chemical libraries. It has been shown that complex deep learning methods outperform conventional machine learning (ML) methods in QSAR and ligand-based virtual screening1-3 but such approaches generally lack explanability. Hence, instead of developing more sophisticated AI methods (i.e., pursuing a model-centric approach), we wanted to explore the potential of a data-centric AI paradigm for virtual screening. A data-centric AI is an intelligent system that would automatically identify the right type of data to collect, clean and curate for later use by a predictive AI and this is required given the large volumes of chemical data that exist in chemical databases - PubChem alone has over 100 million unique compounds. However, a systematic assessment of the attributes and properties of suitable data is needed. We show here that it is not the result of deficiencies in current AI algorithms but rather, poor understanding and erroneous use of chemical data that ultimately leads to poor predictive performance. Using a new benchmark dataset of BRAF ligands that we developed, we show that our best performing predictive model can achieve an unprecedented accuracy of 99% with a conventional ML algorithm (SVM) using a merged molecular representation (Extended + ECFP6 fingerprints), far surpassing past performances of virtual screening platforms using sophisticated deep learning methods. Thus, we demonstrate that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening because conventional ML can perform exceptionally well if given the right data and representation. We also show that the common use of decoys for training leads to high false positive rates and its use for testing will result in an over-optimistic estimation of a models predictive performance. Another common practice in virtual screening is defining compounds that are above a certain pharmacological threshold as inactives. Here, we show that the use of these so-called inactive compounds lowers a models sensitivity/recall. Considering that some target proteins have a limited number of known ligands, we wanted to also observe how the size and composition of the training data impact predictive performance. We found that an imbalance training dataset where inactives outnumber actives led to a decrease in recall but an increase in precision, regardless of the model or molecular representation used; and overall, we observed a decrease in the models accuracy. We highlight in this study some of the considerations that one needs to take into account in future development of data-centric AI for CADD.

7
Dynamic applicability domain (dAD) for compound-target binding affinity prediction with confidence guarantees

Orsolic, D.; Smuc, T.

2022-08-23 bioinformatics 10.1101/2022.08.22.504786 medRxiv
Top 0.1%
18.4%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWIncreasing efforts are being made in the field of machine learning to advance the learning of robust and accurate models from experimentally measured data and enable more efficient drug discovery processes. The prediction of binding affinity is one of the most frequent tasks of compound bioactivity modelling. Learned models for binding affinity prediction are assessed by their average performance on unseen samples, but point predictions are typically not provided with a rigorous confidence assessment. Approaches such as conformal predictor framework equip conventional models with more rigorous assessment of confidence for individual point predictions. In this paper, we extend the inductive conformal prediction (ICP) framework for the dyadic data, such as compound-target binding affinity prediction task. The new framework is based on dynamically defined calibration sets that are specific for each testing interaction pair and provides prediction assessment in the context of calibration pairs from its compound-target neighbourhood, enabling improved guarantees based on local properties of the prediction model. The effectiveness of the approach is benchmarked on several publicly available datasets and through testing in more realistic scenarios with increasing levels of difficulty on a bespoke, complex compound-target binding affinity space. We demonstrate that in such scenarios, novel approach combining applicability domain paradigm with conformal prediction framework, produces superior confidence assessment with informative prediction regions compared to other state-of-the-art conformal prediction approaches.

8
Bioschemas Training Profiles: A set of specifications for standardizing training information to facilitate the discovery of training programs and resources

Jael Castro, L.; Palagi, P. M.; Beard, N.; Bioschemas Training Profiles Group Members, ; ELIXIR FAIR Training Focus Group, ; The GOBLET Foundation, ; Attwood, T. K.; Brazas, M. D.

2022-11-29 scientific communication and education 10.1101/2022.11.24.516513 medRxiv
Top 0.1%
18.3%
Show abstract

Stand-alone life science training events and e-learning solutions are amongst the most sought-after modes of training because they address both point-of-need learning and the limited timeframes available for upskilling. Yet, finding relevant life sciences training courses and materials is challenging because such resources are not marked up for Internet searches in a consistent way. This absence of mark-up standards to facilitate discovery, re-use and aggregation of training resources limits their usefulness and knowledge-translation potential. Through a joint effort between the Global Organisation for Bioinformatics Learning, Education and Training (GOBLET), the Bioschemas Training community and the ELIXIR FAIR Training Focus Group, a set of Bioschemas Training profiles has been developed, published and implemented for life sciences training courses and materials. Here, we describe our development approach and methods, which were based on the Bioschemas model, and present the results for the three Bioschemas Training profiles: TrainingMaterial, Course and CourseInstance. Several implementation challenges were encountered, which we discuss alongside potential solutions. Over time, continued implementation of these Bioschemas Training profiles by training providers will obviate the barriers to skill development, facilitating both the discovery of relevant training events to meet individuals learning needs, and the discovery and re-use of training and instructional materials.

9
Enabling automatic generation of protein-ligand complex datasets with atomistic detail

Gutermuth, T.; Ehmki, E. S. R.; Flachsenberg, F.; Penner, P.; Hoenig, S. M. N.; Harren, T.; Rarey, M.

2026-01-15 bioinformatics 10.64898/2026.01.15.699426 medRxiv
Top 0.1%
17.4%
Show abstract

Predicting protein-ligand bioactivities is known to be challenging yet crucial in any drug discovery project. In a protein structure-based scenario, supervised machine-learning models have been highly competitive for at least 30 years. Regardless of the machine-learning method used, dataset size and quality are key aspects in model training and validation. In general, datasets are the foundation upon which accurate performance estimates can be obtained. While well-curated repositories exist for bioactivity and protein structure data, combining these two types of data is particularly challenging. With ActivityFinder, we recently introduced a fully-automated process for linking these data sources relying on protein sequence and molecular structure only. By combining ActivityFinder with previously developed tools for structure quality estimation and property calculation, we created StrAcTable, an automatically constructed dataset of annotated protein-ligand complexes. The automated procedure allows for continued and sustainable growth. StrAcTable includes detailed descriptions of the quality of matching between ChEMBL and PDB, of the macromolecular structure, small-molecule ligands bound, and bioactivity data from ChEMBL. Based on ChEMBL Version 35, the StrAcTable contains 20 063 protein-ligand complexes with bioactivity values, enabling an efficient construction of training and validation datasets for structure-based molecular design method development.

10
Convex-PLR - Revisiting affinity predictions and virtual screening using physics-informed machine learning

Kadukova, M.; Chupin, V.; Grudinin, S.

2021-09-15 bioinformatics 10.1101/2021.09.13.460049 medRxiv
Top 0.1%
15.0%
Show abstract

Virtual screening is an essential part of the modern drug design pipeline, which significantly accelerates the discovery of new drug candidates. Structure-based virtual screening involves ligand conformational sampling, which is often followed by re-scoring of docking poses. A great variety of scoring functions have been designed for this purpose. The advent of structural and affinity databases and the progress in machine-learning methods have recently boosted scoring function performance. Nonetheless, the most successful scoring functions are typically designed for specific tasks or systems. All-purpose scoring functions still perform poorly on the virtual screening tests, compared to precision with which they are able to predict co-crystal binding poses. Another limitation is the low interpretability of the heuristics being used. We analyzed scoring functions performance in the CASF benchmarks and discovered that the vast majority of them have a strong bias towards predicting larger binding interfaces. This motivated us to develop a physical model with additional entropic terms with the aim of penalizing such a preference. We parameterized the new model using affinity and structural data, solving a classification problem followed by regression. The new model, called Convex-PLR, demonstrated high-quality results on multiple tests and a substantial improvement over its predecessor Convex-PL. Convex-PLR can be used for molecular docking together with VinaCPL, our version of AutoDock Vina, with Convex-PL integrated as a scoring function. Convex-PLR, Convex-PL, and VinaCPL are available at https://team.inria.fr/nano-d/convex-pl/.

11
AutoLead: An LLM-Guided Bayesian Optimization Framework for Multi-Objective Lead Optimization

Zhang, Y.; Choong, J. j.; Ozawa, K.

2025-08-23 bioinformatics 10.1101/2025.08.19.671029 medRxiv
Top 0.1%
14.8%
Show abstract

The process of lead optimization in drug discovery is a complex, multi-objective challenge that remains a major bottleneck in the development of new therapeutics. Traditional approaches often struggle to efficiently explore the vast chemical space while simultaneously optimizing multiple, and sometimes conflicting, molecular properties. In this work, we present AutoLead, a novel framework that integrates Large Language Models (LLMs) with multi-objective Bayesian optimization to tackle this challenge. By leveraging the chemical reasoning capabilities of LLMs, AutoLead effectively guides the search for novel drug-like molecules that satisfy multiple objectives. We evaluate our approach on two molecule optimization tasks, achieving state-of-the-art results. Furthermore, we introduce a new benchmark dataset designed around a more realistic lead optimization scenario, where the task is to modify compounds that violate Lipinskis Rule of Five to simultaneously meet all criteria and improve their QED score. Through extensive experiments and a detailed case study, we demonstrate the potential of combining LLMs with black-box optimization techniques for more efficient and practical drug discovery.

12
AlphaFold3 in Drug Discovery: A Comprehensive Assessment of Capabilities, Limitations, and Applications

Zheng, H.; Wang, J.

2025-04-08 biochemistry 10.1101/2025.04.07.647682 medRxiv
Top 0.1%
14.8%
Show abstract

Accurate prediction of protein-ligand interactions remains a cornerstone challenge in drug discovery. AlphaFold3 (AF3), a recent breakthrough Diffusion Transformer model, holds significant promise for structural biology, but its performance across diverse pharmaceutical applications requires systematic evaluation. In this study, we comprehensively benchmark AF3s capabilities using carefully curated datasets, examining its performance in binary protein-ligand complexes, apo/holo structural variations, GPCR-ligand conformations, ternary systems, and inhibitor affinity prediction. Our analysis reveals that AF3 excels at predicting static protein-ligand interactions with minimal conformational changes, significantly outperforming traditional docking methods in side-chain orientation accuracy. However, we identify critical limitations: AF3 struggles with protein-ligand complexes involving significant conformational changes (>5[A] RMSD), demonstrates a persistent bias toward active GPCR conformations regardless of ligand type, performs poorly on ternary complex prediction, and lacks reliable affinity ranking capability. Notably, AF3s performance declined significantly on structures released after its training cutoff date, suggesting potential memorization rather than physical understanding of molecular interactions. We explored AF3s practical utility through applications in chemoproteomics data interpretation, drug resistance mutation prediction, and kinome profiling simulation. AF3 demonstrated value as a "true-hit binary interaction modeler," capable of generating reliable structural models for experimentally validated binding pairs. However, its ranking metrics showed minimal correlation with experimental binding affinities and limited ability to differentiate across the kinome, highlighting the need for integration with physics-based scoring methods. Our findings indicate that while AF3 represents a significant advancement in protein-ligand structure prediction, it requires complementary approaches to address its limitations in conformational sampling, affinity ranking, and complex system modeling. Recent developments like YDS Ternoplex suggest that enhanced sampling techniques can overcome some of these limitations. The optimal strategy for leveraging AF3 in drug discovery likely involves its integration into hybrid computational pipelines that combine AI-based prediction with physics-based refinement and experimental validation.

13
Predicting small-molecule inhibition of protein complexes

Yaseen, A.; Roy, S.; Akhter, N.; Ben-Hur, A.; Minhas, F.

2024-08-23 bioinformatics 10.1101/2024.08.23.609286 medRxiv
Top 0.1%
14.6%
Show abstract

MotivationProtein-Protein Interactions (PPIs) are crucial in biological processes and disease mechanisms, underscoring the importance of discovering PPI inhibitors in drug development. Machine learning can expedite this discovery process. Although machine learning techniques for predicting general compound inhibition are available, we are not aware of any that accurately forecast the inhibitory effect of a compound on a specific protein complex, utilizing inputs from both the compound and the protein complex. MethodsWe present the first targeted machine learning based predictor of small molecule based inhibition of protein complexes. Our proposed graph neural network integrates the structure of a protein complex, its protein-protein binding site or interface features and a compounds SMILES representation to predict the potential of the given compound to inhibit the interaction between proteins in the given complex in a targeted manner. ResultsValidated on the 2p2i-DB-v2 database, encompassing 714 inhibitors across 23 complexes with over 12,000 instances, our model achieves superior predictive accuracy (cross-validation AUC-ROC of 0.86), outperforming established kernel methods and pre-trained neural networks. We further tested the predictive performance of our model on two independent external datasets - one collected from recent publications and another consisting of putative inhibitors of the SARS-CoV-2-Spike and Human-ACE2 protein complex with AUC-ROCs of 0.82 and 0.78, respectively. Our targeted predictor introduces a novel approach for PPI inhibitor discovery, laying foundational work for future advancements in addressing this complex and previously unexplored prediction challenge. AvailabilityCode/supplementary material available: https://github.com/adibayaseen/PPI-Inhibitors

14
Predicting Degradation Potential of Protein Targeting Chimeras

Petrou, A.; Minhas, F.

2024-09-19 bioinformatics 10.1101/2024.09.16.613208 medRxiv
Top 0.1%
14.5%
Show abstract

PRoteolysis TArgeting Chimeras (PROTACs) can inhibit protein activity by utilizing natural proteasomal degradation pathways for the degradation of target proteins. Being able to determine the degradation potential of PROTACs is crucial in drug development as it can lead to time, labor and cost savings. In this paper, we present a novel machine-learning pipeline that utilizes common compound fingerprints and a pre-trained graph neural network for the prediction of half-maximal degradation concentration of PROTACs by benchmarking a variety of protein tertiary structures and chemical features. Based on critical analysis of our cross-validation and independent test results, we have highlighted several key challenges underlying this prediction problem that need to be addressed to improve the generalization of predictive models in this domain. Moreover, we demonstrate the effectiveness of our approach by testing it on two different datasets and show that it performs better than the current state of the art with an AUC-ROC of 0.85 and accuracy of 0.875 on the DeepPROTACs test dataset.

15
Bio-Mol:Pretraining Multimodality Bioactivity Profile for Enhancing Small Molecule Property Prediction

Yip, H. F.; Wei, X.; Li, Z.; Ren, Q.; Cao, D.; Zhang, L.; Lu, A.

2023-11-05 bioinformatics 10.1101/2023.11.02.565401 medRxiv
Top 0.1%
14.4%
Show abstract

Non-optimized pharmacokinetic parameters serve as the primary cause of failure in clinical trials of drugs. Therefore, the successful prediction of pharmacokinetic parameters during the pre-clinical stage is crucial for the success of drug candidates. Conventional methods primarily rely on 2D structural information, while advanced models extend the features to other structural-related information or use advanced computational models to improve prediction accuracy. However, to gain a comprehensive understanding of small molecules, integrating bioactivity profiles with chemical structural information is essential. One significant challenge in this integration is the high proportion of missing values within experimentally validated bioactivity profiles for most small molecules. To address this challenge, we introduce Bio-Mol, an artificial intelligence model designed to effectively handle this issue. Bio-Mol utilizes a pretrain and finetune strategy, enabling the incorporation of a large proportion of missing bioactivity profiles during the small molecule representation learning process. Comprehensive evaluations of Bio-Mol demonstrate a notable improvement in predicting molecule properties. The integration of missing bioactivity profiles enhances the AUROC of average 5.2% compared to the previous state-of-the-art models predictions. Furthermore, we explore the potential of Bio-Mol in predicting synergistic drug combinations, highlighting its versatility and broader applications in the field of drug discovery. The successful implementation of Bio-Mol showcases its efficacy in over-coming the challenges posed by missing bioactivity profile data. This model paves the way for optimizing small molecule pharmacokinetics prediction, providing valuable insights for drug development and discovery processes.

16
Accelerating ligand discovery by combining Bayesian optimization with MMGBSA-based binding affinity calculations

Andersen, L.; Rausch-Dupont, M.; Martinez Leon, A.; Volkamer, A.; Hub, J.; Klakow, D.

2025-11-13 bioinformatics 10.1101/2025.06.22.660936 medRxiv
Top 0.1%
14.2%
Show abstract

Predicting protein-ligand binding affinity with high accuracy is critical in structure-based drug discovery. While docking methods offer computational efficiency, they often lack the precision required for reliable affinity ranking. In contrast, molecular dynamics (MD)-based approaches such as MMGBSA provide more accurate binding free energy estimates but are computationally intensive, limiting their scalability. To address this trade-off, we introduce an active learning framework that automates molecule selection for docking and MD simulations, replacing manual expert-driven decisions with a data-efficient, model-guided strategy. Our approach integrates fixed -- partly pre-trained deep learning -- molecular embeddings (MolFormer, ChemBERTa-2, and Morgan fingerprints) with adaptive regression models (e.g. Bayesian Ridge and Random Forest) to iteratively improve binding affinity predictions. We evaluate this approach retro-spectively on a new dataset of 59,356 chemically diverse compounds from ZINC-22 targeting the MCL1 protein using both AutoDock Vina and MMGBSA binding free energy scores. Our results show that incorporating MMGBSA scores into the active learning loop significantly enhances performance, recovering 79.9% of the top 1% binders in the whole dataset, compared to only 6.7% when using docking scores alone. Notably, MMGBSA exhibits a stronger correlation with experimental binding affinities than AutoDock Vina on our dataset and enables more accurate ranking of candidate compounds in a runtime efficient way. Furthermore, we demonstrate that a one-at-a-time acquisition active learning strategy consistently outperforms traditional batched acquisition, the latter achieving just 78.4% recovery with MolFormer and Bayesian Ridge. These findings underscore the potential of integrating deep learning-based molecular representations with MD-level accuracy in an active learning framework, offering a scalable and efficient path to accelerate virtual screening and improve hit identification in drug discovery.

17
CRD: a De novo Design algorithm for prediction of Cognate Protein Receptors for small molecule ligands

Sankar, S.; Chandra, N.

2023-04-02 bioinformatics 10.1101/2023.03.30.534983 medRxiv
Top 0.1%
14.1%
Show abstract

While predicting a new ligand to bind to a protein is possible with current methods, the converse of predicting a receptor for a ligand is highly challenging, except for very closely-related known protein-ligand complexes. Predicting a receptor for any given ligand will be path-breaking in understanding protein function, mapping sequence-structure-function relationships and for several aspects of drug discovery including studying the mechanism of action of phenotypically discovered drugs, off-target effects and drug repurposing. We use a novel approach for predicting receptors for a given ligand through de novo design combined with structural bioinformatics. We have developed a new algorithm CRD, that has multiple modules which combines fragment-based sub-site finding, a machine learning function to estimate the size of the site, a genetic algorithm that encodes knowledge on protein structures and a physics-based fitness scoring scheme. CRD has a pseudo-receptor design component followed by a mapping component to identify possible proteins that house the site. CRD is designed to cater to ligands with known and unknown complexes. CRD accurately recovers sites and receptors for several known natural ligands including ATP, SAM, Glucose and FAD. It designs similar sites for similar ligands, yet to some extent distinguishes between closely related ligands. More importantly CRD correctly predicts receptor classes for several drugs such as penicillins and NSAIDs. We expect CRD to be a valuable tool in fundamental biology research as well as in the drug discovery and biotechnology industry.

18
Small molecule bioactivity benchmarks are often well-predicted by counting cells

Seal, S.; Dee, W.; Shah, A.; Zhang, A.; Titterton, K.; Cabrera, A. A.; Boiko, D.; Beatson, A.; Puigvert, J. C.; Singh, S.; Spjuth, O.; Bender, A.; Carpenter, A. E.

2025-04-30 bioinformatics 10.1101/2025.04.27.650853 medRxiv
Top 0.1%
12.7%
Show abstract

Phenotypic profiling methods, such as Cell Painting and gene expression, have been widely used to predict compound bioactivity, often showing improvement over predictive models based on chemical structures alone. We discovered that a large subset of assays in widely-used benchmark datasets either directly relate to cell health and cytotoxicity or are assays intending to capture a more specific phenotype but whose active compounds impact cell count, while inactives do not. As a result, counting cells can achieve similar predictive performance as Cell Painting or gene expression data. Filtering benchmarks to include only assays relating to protein targets reveals that Cell Painting can capture information that cannot be predicted by mere cell counting. We re-evaluated three benchmark datasets used with Cell Painting data and observed that, in many cases, cell count models produced an AUC comparable to models using the full Cell Painting profiles. However, in protein-target-specific benchmarks across 17 distinct protein targets, Cell Painting features demonstrated unique predictive power, outperforming mean balanced accuracy from cell count models with a relative improvement of 19.6%. We propose five practical recommendations for benchmarking machine learning models for predicting bioactivity, including using cell count as a baseline feature. Although multi-class classification applications (such as matching samples based on their morphological profile) are less likely to be predictable by cell count than bioactivity benchmarks, these recommendations are broadly applicable to machine learning for drug discovery.

19
BioBrigit, A Hybrid Deep Learning and Knowledge-based Approach to Model Metal Pathways in Proteins: Application to a Di-Copper Tyrosinase

Marechal, J.-D.; Sodupe, M.; Sanchez, J. E.; Fernandez Diaz, R.; Roldan Martin, L.

2024-09-22 bioinformatics 10.1101/2024.09.19.613875 medRxiv
Top 0.1%
12.7%
Show abstract

The interaction of metallic species with proteins has been fundamental in evolution and key in many physiological processes. How metals bind to proteins also holds promise in many fields, like the design of new biocatalysts or the fight against pathogens. Nonetheless, uncovering the mechanism under which proteins recruit metal ions is far from understood and is one of the challenges in bioinorganic chemistry and structural biology. Computational methods are potentially among the most promising tools for this endeavor. Only a handful of efficient structural predictors of metal binding sites exist to date. Most focus on identifying the most stable binding sites in the protein scaffolds. Although these methods are very interesting, they do not consider the exploration of transient, sub-optimal binding sites that could be relevant in metal binding pathways in proteins. At the far end of modeling capabilities nowadays, we introduce BioBrigit, a hybrid Deep Learning - knowledge-based approach that suggests metal binding pathways in proteins. To demonstrate the methods viability, we apply it to the di-copper tyrosinase from Streptomyces castaneoglobisporus, a system for which crystallographic experiments allowed the identification of a series of transient sites of the copper in its path from a chaperone to the final catalytic site. Combined with homology modeling and large-scale molecular dynamics, BioBrigit allows for computational characterization of all experimental sites and for better understanding of the copper recruitment mechanism. BioBrigit appears as an asset in a field full of unknowns like metal binding to proteins and opens the way to further algorithms in this area. Source code, documentation, and data are available at https://github.com/insilichem/BioBrigit

20
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.

2026-04-07 bioinformatics 10.64898/2026.04.04.716470 medRxiv
Top 0.1%
12.5%
Show abstract

Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.